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Abstract 

Background: Genes occurring co-localized in multiple genomes can be strong indicators for either functional 
constraints on the genome organization or remnant ancestral gene order. The computational detection of these 
patterns, which are usually referred to as gene clusters, has become increasingly sensitive over the past decade. The 
most powerful approaches allow for various types of imperfect cluster conservation: Cluster locations may be internally 
rearranged. The individual cluster locations may contain only a subset of the cluster genes and may be disrupted by 
uninvolved genes. Moreover cluster locations may not at all occur in some or even most of the studied genomes. The 
detection of such low quality clusters increases the risk of mistaking faint patterns that occur merely by chance for 
genuine findings. Therefore, it is crucial to estimate the significance of computational gene cluster predictions and 
discriminate between true conservation and coincidental clustering. 

Results: In this paper, we present an efficient and accurate approach to estimate the significance of gene cluster 
predictions under the approximate common intervals model. Given a single gene cluster prediction, we calculate 
the probability to observe it with the same or a higher degree of conservation under the null hypothesis of 
random gene order, and add a correction factor to account for multiple testing. Our approach considers all 
parameters that define the quality of gene cluster conservation: the number of genomes in which the cluster 
occurs, the number of involved genes, the degree of conservation in the different genomes, as well as the 
frequency of the clustered genes within each genome. We apply our approach to evaluate gene cluster 
predictions in a large set of well annotated genomes. 



Background 

Gene order-based analysis of whole genomes has become 
an important field in comparative genomics. It is well 
known that genomes evolve, not only at the level of 
nucleotide sequence, but also by means of large-scale rear- 
rangements, such as inversions and transpositions, as well 
as changes of the gene content. Focusing on this higher- 
level structure, genomes are usually modeled as sequences 
of integers with genes belonging to the same gene family 
represented by the same integer. If no selective pressure 
was acting on whole genome evolution, gene order and 
gene content would randomize over time. In reality, parti- 
cularly in bacterial genomes, we observe low overall con- 
servation of gene order between species, contrasted by a 

* Correspondence: kjahn@cebitec.uni-bielefeld.de 

'Genome Informatics, Faculty of Technology, Bielefeld University, 33615 

Bielefeld, Germany 

Full list of author information is available at the end of the article 



small number of well-conserved segments. These are often 
referred to as gene clusters. Such local aberrations from 
genome randomization provide evidence for various bio- 
logical phenomena and are of high interest in functional 
and evolutionary genomics [1-9]. 

When comparing a large number of genomes, the iden- 
tification of these structures can be a challenging task 
since conservation patterns may be highly variable across 
species. Due to micro-rearrangements, gene order can 
also vary between cluster occurrences. Gene insertions 
and losses, or mis-annotations may lead to cluster occur- 
rences interrupted by genes that do not belong to the 
cluster, or containing only a subset of the clustered 
genes. Moreover, a gene cluster may be present only in a 
subset of the genomes under study. (A gene cluster dis- 
playing all these features, except for mis-annotations, is 
shown in Additional File 1) Taking these effects into 
account in different ways, several models of gene clusters 
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and algorithms for their detection have been suggested 
[10-17]. 

However, it may as well be the case that seemingly con- 
served patterns occur merely by chance. To estimate the 
likeliness of such events, appropriate statistics are needed 
to quantify the probability of finding gene clusters by 
chance. Such statistics have been developed for some 
gene cluster models, in particular r-windows [18-20], 
segmental homologies (fe-clumps) [21], and max-gap clus- 
ters [22-24] . Other methods solve the problem of assign- 
ing significance to predicted gene clusters in an ad-hoc 
manner, including C-Hunter [25], OrthoCluster [26], 
MCMuSeC [16], CYNTENATOR [27], and i-ADHoRe 
[28]. In this paper we consider the gene cluster model of 
approximate common intervals [15]. This can be easily 
applied to a variety of use cases, as it offers a combina- 
tion of parameter flexibility and efficiency of computa- 
tion, even for very large data sets. The variant reference 
gene clusters [17] proves especially useful. We provide a 
statistical test for evaluating gene cluster predictions 
against the null hypothesis of random gene order. For 
our background model, we consider, for each genome G, 
a random string S of the same length where each charac- 
ter (gene) has a probability proportional to its frequency 
in Q. For multi-chromosomal genomes, or in cases where 
the (unfinished) genome sequence consists of multiple 
contigs, we do the same for each chromosome/contig 
individually. We then estimate p-values, that is, the prob- 
abilities of gene clusters of the observed quality or higher 
being found in the random genomes by chance. 

Since the random genomes are drawn independently, we 
can proceed as follows: For each genome, we compute the 
likelihood of a gene cluster occurring in the corresponding 
random genome. These are the individual p-values for 
each genome. Next, we demonstrate how to combine 
p-values from individual genomes into one p-value for the 
gene cluster. The problem of multiple testing is then con- 
sidered by applying a false discovery rate control to mini- 
mize this effect. Finally, we demonstrate that there is 
excellent concurrence between our calculated p-values 
and empirically determined p-values and that the method 
is able to recognize known gene clusters from large geno- 
mic data sets. 

Methods 

Preliminaries 

We model a genome as a string over a finite alphabet 
E = {1, . . . , a} of gene family ids, such that genes 
belonging to the same homology family are represented by 
the same integer. In the following, we use the terms 
"genome" and "string" and, also, "gene" and "character" 
interchangeably. Given a string S, we denote its length by 
|S| and refer to its jth character as S[i], 1 < i < |S|. A char- 
acter c € E is said to occur at position [ in S if S[i] = c. 



A substring of S from position a to b is denoted S[a, b] for 
1 < a < b < |S|. To capture the character content of 
S[a, b] regardless of sequential arrangement and multiple 
character occurrences, we define the character set of 
S[a, b] as 

CS (S [a, b]) := {S [i] : a < i < b] c £ . 

The corresponding index interval [a, b] is termed a loca- 
tion of C C E if and only if C = CS(S[a, b]). An interval 
[a, b] is left-maximal if a = 1 or S[a — 1] £ CS(S[a, b]); it 
is right-maximal if b = |S| or S [b + l] ^ CS(S[a, b\); and 
it is maximal if it is both left- and right-maximal. 

Given k > 2 strings Si, ... Sfe, fe maximal intervals 
written as fe-tuple {[a\, b\], . . . , [a&, bk]) are called com- 
mon intervals in Si, . . . S& if and only if 

CS{S l \a l M\) = ... = CS{S k \a k ,h\) =: C. 

Given that Si, . . ., Sj. are genomes, the above charac- 
ter set C corresponds to a gene cluster with perfectly 
conserved gene content. 

In order to model gene clusters with incomplete con- 
servation patterns, we quantify the differences in the 
gene content of their approximate gene cluster occur- 
rences via their symmetric set distance. This measure 
defines the distance between two finite sets C and C as 
the cardinality of their symmetric difference: 

d(c,C) := \c\C\ + \C\c\ = \cuc'\-\cnc'\. 

This constitutes a metric and therefore meets all intui- 
tive notions of a distance measure, such as validity of the 
triangle inequality. In the context of gene clusters, it cor- 
responds to a simple summation over the genes deleted 
and the genes inserted into a cluster occurrence. Like 
earlier character set based gene cluster models [29], it 
disregards recurrences of genes within the same cluster 
occurrence. 

Based on this distance notion, we extend the concept of 
character set locations towards approximate conserva- 
tion: Given an integer S > 0, we define an interval [a, b] in 
a string S as a ^-location of character set C, if and only if, 
D(C, CS{S[a, b])) < S, and C n CS{S[a, b\) £ 0. 

Let Si, . . ., Sfe be genomes over a gene alphabet £. Let 
s > 2 be the minimum cluster size, S > 0 a distance 
threshold and fe' a quorum parameter with 2 < fe' < h A 
reference gene cluster for parameters s, S and fe' is a set of 
genes C C E with |C| > 5 such that C has an exact occur- 
rence in one of the genomes and 5-locations in at least 
]i — l other genomes. In other words, there exist i, a, b 

such that C = CS{Si[a, b]), and / c {1, fe} - {i} with 

l/l > fe' — 1 such that each Sj has a 5-location of C for all 

In [17] we studied the following problem: Given gen- 
omes Si, . . ., Sfe and parameters s , 8, fe', find an reference 
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gene clusters C C T, in Si, . . . , Sk, and for each reference 
gene cluster, output all its optimal 5-locations. (A 
S-location is not optimal if it is a subinterval of a 
S-location that has a smaller distance to C.) We intro- 
duced an efficient algorithm that runs in 
0(k 2 n 2 8 2 + k 2 n 2 ) time using 0(kn 2 ) space, where n is the 
length of the largest genome [17]. The algorithm is exact, 
meaning that it is guaranteed to find all reference gene 
clusters and their optimal occurrences as specified by the 
search parameters. 

The above definitions do not take into account multi- 
chromosomal genomes, or genomes that were not com- 
pletely assembled and still consist of several contigs. 
However, it is simple to extend these definitions, as well 
as the remainder of this paper, to the multi-chromosomal 
or multi-contig case. For example, we may assume that 
the different chromosomes/contigs of one genome are 
concatenated in a single string, separated by symbols 
$ ^ E. We can then assume that neither a gene cluster 
nor an interval is allowed to contain the character $. 
Further details will be omitted, aside from saying that 
some of the complexity bounds mentioned below actually 
improve for multi-chromosomal and multi-contig 
genomes. 

Significance of a gene cluster for one genome 

In this section we estimate the probability of a fixed 
gene cluster C C £ having a 5-location in a random gen- 
ome S of length n, i. e. the p-value of finding an occur- 
rence of C in genome S : 

p-value = P(S has a 5-location of C) (1) 

In the following we assume that S < \C\ — 2 holds. 
Otherwise the p-value is equal to one whenever 
C n CS(S) ± 0, and zero otherwise. Let p{L, d) = p{L, d, C) 
be the probability that a random occurrence of length L 
has a symmetric set distance exactly d to C. Let 
cj(L, 8) = q{L, 8, C) be the probability that it has a sym- 
metric set distance of at most 8. Note that p{L,d) and 
cf{L, 8) depend on the cluster C; in the following, our nota- 
tions omit this dependency for the sake of readability. 

Then, q{L, 8) = p(L, d\ To simplify our computa- 

tion, we assume that occurrence probabilities are indepen- 
dent for all intervals [a, b] where 1 < a < b < n. Clearly, 
this assumption is not correct: Let A be the event that 
interval [i, j] forms a 5-location of C, and let B be the event 
that interval [i, j + l] forms a 5-location of C. The fact that 
the intervals share all positions but one creates a number 
of non-trivial dependencies. In case S\j + 1] e CS(S[i, j]), 
the two events are either both true or both false. When 
the distance between CS(S[i, ;]) and C is smaller than 
8, p[B\A) = 1, and vice versa. Such dependencies apply not 



only to [i, j] and [i,j +1], but to all intersecting interval 
pairs. However, we will show later on that the p-values 
computed under this assumption are very close to the true 
p-values, for any realistic choices of parameters. We esti- 
mate the p-value for a single genome as 

P(S has a <5-locations of C) « 1 - J~[ (1 - q{b - a + 1,8)) 

(2) 

=i- n 

L=l,...,n 

To minimize the effects of rounding error accumula- 
tion, we instead compute 

PfShas a ^location of C) » 1 - exp ^ ^ (n - L+ 1) ■ log (1 - q(L, S))j (3) 

which can be calculated with high accuracy, using 
mathematical library functions for f(x) := exp(x) — 1 
and g(x) := \og(x +1). 

Exact computation using dynamic programming 

We need to compute p{L, d), the probability that the 
character set of a random interval of length L has a sym- 
metric set distance d to a given (fixed) gene cluster C, for 
all L and d < 8. Let Si be a random string of length L, and 
let P(c) denote the probability of character c € E for any 
position of the random string. For a sub-alphabet £' C £, 
set P(E') := E ceS 'P(c) The distance d between C and 
CS (Si) can be partitioned as d = d- + d+: Here, d- is the 
number of characters from C that are missing in Si, and d+ 
is the number of additional characters in CS(Si). Conse- 
quently, we can partition the positions in Si into two 
types: those positions containing characters from C, and 
those positions containing characters from C := J2 ~ C- 
Assume that | positions of Si are occupied by characters 
from C, 0 < I < L, and that L — I positions are occupied 
by characters from Q. We calculate 
d 

p(L, <i)-£P(|C\CS(Si)| -d-A\CS(S L )\C\ -d-d.) 

d.=0 

d L 

= EE P ( ||i|s il'l eC ' 1 -' 5L|1 -1a|C\CS(Si)| -i- A |OS(Si)\C| =d-d.) (4) 

d.=0 t=0 

- £ I>(U). p-(l, <!-). P*(L - 1, <i - 

1=0 d.=0 

where po(l,L) is the probability of drawing a random 
string of length L with / characters from C and L — / char- 
acters from C; p~ (I, dS) is the probability that, for a ran- 
dom string Si of length / over the alphabet C, CS(Si) is 
missing d- characters from C; and p + (L — l,d— d_) is the 
probability that, for a random string S/< of length J' = l — | 
over the alphabet C, CS(Si>) contains d + = d — d_ different 
characters. If all these values are known, we can compute 
the desired probability using (4) in time 0(dL). In practice, 
we found that p~(l,d-) in (4) decreases rapidly with 
increasing \. To this end, we can stop the summation, as 
well as the computation of p~(l, d-) and p + (L — I, d+), as 
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d 

soon as y p 0 (f,L), p~(l,dS). p + (L — I, d+) no longer con- 

<L=o 

tributes to the sum. 

Computing po(l,L) is straightforward, using the bino- 
mial distribution: One can see that 

p 0 (u) = (^-p(cy-p(c) L -' 

holds. It remains to be shown how to compute 
p~(l,d-) for missing characters and p + (L — l,d+) for 
additional characters. 
Missing characters 

We first demonstrate how to compute p~(l,d-), the 
probability that a random string S; of length J over the 
alphabet C is missing d- characters from C. Let 
h := \C\ — d- be the number of hits from CS(Si) in C. For 
readability, we base our computation on the number of 
hits h. The order of the characters in the random string is 
not relevant, so we simply check whether a certain char- 
acter from C has been generated. The statistical equiva- 
lent is rolling dice. We assume, for simplicity, that 
C := {1, ... ,Z\. Probabilities of the characters in C are 
conditional probabilities of the same characters in £. We 
define 

p[z,l,h] := P(h different outcomes for /rolls for #'s 1, ...,z). 

So, p\z, I, h] is the probability that, by throwing | dice 
with numbers 1, ... , z, exactly h different numbers have 
been rolled. In the following, we not only iterate over the 
number of rolled dice and different outcomes, but also 
over the numbers that can be rolled. In cases where only 
numbers 1, ... ,z out of 1, ... ,Z can be rolled, we can 
calculate the conditional probability of each outcome 
with the recurrence: 

p[z, I, h\ = P(no z rolled with / dice) ■ p[z - 1, 1, h] 

^ (5) 
+ 22 W times z rolled with Z dice) • p[z - 1, / - I, h - 1]) 

We initialize p[z, 0, 0] = 1, p[z, 0, h] = 0, p[z, I, 0] = 0, 
and p[0, l,h] = 0 for all z and all I, h > 0. The two missing 
probabilities are computed using a binomial distribution. 
In the end, p~(l, d-) = p[Z, I, h] is the probability that in 
a random string of length I over alphabet C, exactly h dif- 
ferent characters from C have been generated. Computa- 
tion takes 0(|C|W 2 ) time and 0(hl) space, as only the 
values for z = Z need to be stored. 
Additional characters 

The recurrence introduced in equation (5), can also be 
used to compute p + (l,d + ). We need to set h := d+, and 
exchange the roles of C and Q. Unfortunately the latter 
modification has a strong impact on the practical runtime, 
due to the linear dependence now being on the much 



larger |C| (compared to the rather small |C| for p~[l, d-)). 
However, we can mitigate this by pooling genes based on 
the frequency of their occurrences in the original genome. 
The genes within each pool have the same occurrence 
probability in the random genome and need not be distin- 
guished in our calculations. It is sufficient to know for 
each pool how many of its genes originate from C, and C 
respectively. 

Let / be the number of different occurrence frequencies 
observed in the original genome, and let Fi, ■ ■ ■ ,Ff denote 
the corresponding gene pools. Given a fixed gene cluster C, 
we denote by j£ the subset of pool F ZI 1 < z < f, that 
consists only of genes from Q We then modify recurrence 
(5) as follows: 

p[z,t,h] = T(no gem of Ff rolled with I dice)' 'p[z- l,lh] 
I mln(«Jf|) 

+ E T. (?(£genesofFC,h'diffenietones,roUedu>ithldiceyp[z- 1,1 - t,h -h'}) 

The initializations are the same as for the previous 
recurrence. We can use the binomial distribution to com- 
pute the value of P (no gene of rolled with | dice) and 
the same type of recurrence as in equation (5) to com- 
pute the second probability, ¥(£ genes of F c , h' different 
ones, rolled with | dice). In the end, p + (l, d + ) = p\f, I, h] is 
the probability that in a random string of length | over 
alphabet C, exactly h different characters from £ have 
been generated. 

Due to the second summation, the asymptotic time 
complexity of the recurrence becomes 0(fh 2 l 2 )- We 
observe that, in practice /, the number of different gene 
occurrence frequencies is very small compared to Q and is 
typically in the size range of large gene clusters. Also, the 
vast majority of genes occur only once in a genome, with 
pool sizes dropping quickly for larger occurrence frequen- 
cies. Since h' is bounded, not only by h but also by \F^\, 
the quadratic dependence on h is unlikely to be relevant in 
practice. 

Next, we show how the values of ¥(£ genes of F% , h' dif- 
ferent ones, rolled with | dice) can efficiently be computed. 
Ideally, we would like to precompute these values once for 
every genome, providing constant-time lookup during 
computation of p-values for the different gene clusters. At 
first sight, these probabilities seem to be specific for each 
cluster, as the gene pools need to be restricted to their 
complement Q However, we know that all genes in a pool 
have the same occurrence probability, therefore it is suffi- 
cient to compute the above values for all residual pools, 
after removing a certain number - not a certain set - of 
genes. For small pools, which are in the majority, not 
much extra work is required to do this for all possible resi- 
dual sizes. For large pools exact computations may, in 
practice, be too costly. In this case, we suggest working 
with pools based on the complete alphabet £. Due to their 
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size, the removal of a few elements has little influence on 
the conditional probabilities of the remaining elements. 

In practice, we use a faster, but less exact approach: We 
replace the more accurate estimation of p + (l,d + ) with a 
much faster preprocessing that leads to almost identical 
results. To this end, we compute a global P + (l, d + ) for the 
complete alphabet, C := E, during preprocessing. This is 
achieved using recurrence (5). We then assume that, for 
any cluster C with additional character probabilities 
p + (l,d+), we have p + (l,d+) ^P + (l,d + ). In doing so, we 
ignore the fact that the gene cluster C removes some of 
the genes from the pool of potential additional genes. 
Clearly, this computation can be carried out very quickly, 
as we have to compute P + (I, d + ) only once for each 
genome. Depending on the distribution of occurrence 
probabilities, the above approximation can distort the 
results significantly. We account for this by setting 
p + {l,d + ) w (1 -P(C)) d+ -P + {l,d + ), thereby taking into 
account the occurrence probabilities of the genes in C. 

Significance of a gene cluster in multiple genomes 

Thus far, we have concentrated on the probability of 
observing gene cluster occurrences in a single genome. 
To estimate the significance of observing a gene cluster 
in multiple genomes, we need to combine the individual 
probabilities into a single p-value. This gives the prob- 
ability of observing a gene cluster C, at least as well 
conserved in the randomized genomes as in the original 
genome set. 

We begin by formalizing the notion of "at least as well 
conserved". Consider the case where a <S-location of C is 
detected in all genomes Si, . . . , S&. To simplify notation, 
we assume that Si, . . . , S& are the remaining genomes 
after removing the reference genome, the one from 
which we took the interval to generate C. Clearly no 
p-value estimation is necessary for this genome, and it 
can be omitted from the following calculations. Let 
d = [d\, . . . ,dk) be a distance vector, such that d, is 
the distance between C and its best approximate occur- 
rence in genome S,, 1 < i < fe. We denote by d ODS the 
distance vector observed for C and the original genomes. 
To make different distance configurations comparable, 
we need to define a linear order of all possible distance 
vectors. We chose an ordering based on the total 
distance, d\. We denote by d Q bs the sum distance of 
d 0 b s . To exclude configurations with 5-locations in fewer 
genomes than observed in the original data, we further 
require that each individual distance in the vector is at 
most 8. To calculate the probability of observing any dis- 
tance vector that satisfies the above constraints, we define 
the following recurrence: 

min(iJ) 

M[i,d]= £ (P(besti-locationmSjhasdistance<i'toC)-M[;- 1, d — <f]). (7) 



The base cases are M[0, 0] = 1, and M[0, d] = 0 for 
d > 0. P(best S-locations in S, has distance d' to C) equals 
P(C has a ^'-location in S,) - P(C has a (d 1 - relocation 
in S,). These probabilities can be computed with equa- 
tion (2). Summing over all M[k, d\ 0 < d < d 0 b s , gives 
the desired p-value. i. e. the probability that C is at least as 
well conserved in the randomized genomes as in the origi- 
nal dataset. This computation takes time 0(k 2 8 2 ), as i and 
d are bounded by fe and d 0 b s < respectively, and we have 
dobs <k - 8. 

When a gene cluster is observed only in a subset of the 
studied genomes, it becomes tricky to define the meaning 
of "at least as well conserved": Is a gene cluster better con- 
served if it occurs in many genomes at a low quality, or in 
fewer genomes but at higher quality? We suggest that the 
latter should be given preference. Otherwise, there is the 
risk of systematically preferring gene clusters that occur in a 
large number of genomes, yet only incompletely in the form 
of one or two genes, over clusters that are very well con- 
served but only in a small number of genomes. We believe 
that the latter are the more interesting cases. Therefore we 
say a gene cluster C with 5-locations in fe' out of k genomes 
and sum distance d 0 b s is conserved at the same or better 
quality in the randomized genomes, if it has 8 -locations in 
at least fe' of them, and the best fe' 8 -locations (from fe differ- 
ent genomes) have a sum distance of at most d Q b s to C. 
Unfortunately the recurrence used in equation (7) cannot 
be extended to compute the corresponding probabilities. 
We need to track the sum of the U smallest distances below 
8, after processing the first i < k genomes. This value cannot 
be computed in a simple recursive manner, as there is no 
optimal substructure underlying the problem: The sum, 
after processing i genomes, depends not only on the sum 
before the genome was added and the distance for the new 
genome, but also on the previous number of distances 
below ,5 and the values of these distances. Only with all of 
this information is it possible to decide whether or not the 
distance encountered for the new genome needs to be 
added to the sum. In the absence of an efficient dynamic 
programming approach, we need to sum probabilities over 
all distance vectors. This becomes infeasible for larger 8- 

In order to avoid exponential running time, we studied 
a simpler approach where we use a fixed distance thresh- 
old 8' for all genomes. This can be either the original 
threshold 8, or the largest entry in d 0 t, s , i. e. the largest of 
the distances between C and its best 5-location in each 
genome. 

As a consequence, we do not need to sum over pair- 
wise distances in the recurrence, but over the number 
of genomes that contain a ^'-location. Let Pj be the like- 
lihood of a ^'-location in genome Sj, computed using 
equation (2). The likelihood of having ^'-locations in 
exactly i out of j genomes Si, . . . ,Sj, 1 < i < j < k, can 
be computed using the recurrence 
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Q [hi] = (i - Pi) ■ Qlii - i] + Pi ■ Q [i - hi - i]. (8) 

The base cases are Q[0, 1] = 1 - pi, Q[l, 1] = pi, and 
Q[i,j] = 0 for all i > j. The likelihood of finding at least k 1 
^'-locations in fe genomes is then the sum over all Q[i, k] 
with k' < i < k. Computation requires 0(fe 2 ) time. This 
method will be referred to as "global distance bound". 

Unfortunately, the above approach has the following 
problem: Consider two (otherwise identical) gene clus- 
ters, both found in three genomes. The first cluster is 
found with distances 0, 1 and 4 in the three genomes; the 
second cluster with distances 3, 4 and 4. Common sense 
tells us that the first gene cluster is more significant, 
because it is less likely to occur by chance. However, the 
approach described above will come up with identical 
likelihoods, as, in both cases we have $' = 4. In fact, for 
the first cluster it may be beneficial to exclude the last 
occurrence, as this may lead to a smaller p-value; we 
omit the details. 

To ensure that gene clusters of this type are evaluated 
differently while keeping the computational complexity 
reasonable, we resort to the following simplification: 
Recall that 8' < 8 is the maximum distance of the cluster 
to any occurrence. For genomes where we do not detect 
a 5-location, we use the single-genome p-value with 
distance threshold <$'; for genomes other than the refer- 
ence genome that contain a S-location (and, hence, a 
^'-location), we use the single-genome p-value with dis- 
tance threshold given by the distance of the detected 
interval in this genome. In the above example, for the 
first cluster, we use distance threshold 0 for the first 
genome, 1 for the second genome, and 4 for the third 
genome; for the second cluster we use distance thresh- 
old 3 for the first genome and 4 for the remaining two. 
This method will be referred to as "individual distance 
bounds". 

False discovery rate control 

Since we are testing significance not only for a single 
gene cluster, but for the complete set of gene cluster pre- 
dictions reported by a search extending over all possible 
reference intervals, we need to adjust our p-values 
accordingly. We use false discovery rate (FDR) control 
[30] to counteract the problem of multiple testing. In 
detail, we sort all the clusters by their p-value and multi- 
ply the p-value p of any gene cluster with the number of 
possible reference intervals in all fe genomes divided by 
the index j in the sorted cluster list: 

k nj.[n j+ l) 



where n\, . . . ,nu are the genome lengths. This is a 
conservative estimation and comes at the cost of 



increasing the probability of producing false negatives. 
That is, gene clusters that should be regarded as signifi- 
cant may be declared non-significant after the FDR 
correction. In addition, equation (9) actually overesti- 
mates the number of gene clusters tested as we do not 
take into consideration certain gene clusters that appear 
multiple times in a genome (and should be tested only 
once). We do not use the more powerful Sidak correc- 
tion [31], as independence of the different tests cannot 
be guaranteed. FDR-corrected p-values can be larger 
than one, which solely indicates that this correction is 
conservative. 

Evaluation of p-value accuracy for a single genome 

We now evaluate the accuracy of the p-value estimation 
we introduced above for a single genome. Recall that we 
use two simplifications to keep this task computationally 
feasible. First, we assume that the intervals within a gen- 
ome do not overlap, in order to gain statistical indepen- 
dence of the probabilities cj{L,8) in equation (2). Second, 
we employ a heuristic approach to deal with additional 
characters in cluster occurrences. To show that our cal- 
culations are still sufficiently accurate, we compare our 
estimated p-values with p-values derived empirically 
from large-scale simulations of random genomes. 

Two simulation studies were performed, one with biolo- 
gical data and one with simulated data. To reduce simula- 
tion time, we performed the first study on four, somewhat 
small, bacterial genomes with just 600 to 850 genes each, 
namely Buchnera aphidicola APS, Ureaplasma urealyti- 
cum, Mycoplasma pneumoniae, and Borrelia burgdorferi. 
We downloaded these genomes from the NCBI database 
[32] and assigned homology families, using GHOSTFAM 
[33] with standard parameters. Ten billion random 
sequences were then generated, with the same length and 
character frequencies as the original genomes. 

Our reference gene cluster detection algorithm [17] was 
applied to the four original genomes, with three different 
parameter settings (8 = 1, s = 4), (5 = 3, s = 6), and 
(,5 = 5, s = 9). The quorum parameter was set to U = 2 in 
each case. This search returned 459 gene clusters. We 
computed for each gene cluster C, the p-value of each 
occurrence, excluding the reference interval. This was 
done with equation (2), by setting the threshold to 8', the 
symmetric set distance between C and the character set of 
the occurrence. Next, the p-value for a genome with ran- 
domized gene order containing a ^'-location of C was 
empirically determined. To this end, all random genomes 
corresponding to the genome where the occurrence was 
observed were searched for intervals with a symmetric set 
distance of, at most, 8' to C- The number of genomes with 
at least one such occurrence was divided by the total num- 
ber of tested genomes. This gives an empirical estimate of 
the true p-value. When no occurrences were found, we 
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omitted the pair from further analysis, as this shows the 
empirical p-value to be out of the scope of the current 
evaluation. This resulted in the omission of 277 gene clus- 
ters. The maximum p-value calculated for any such 
omitted pair was 9.23 • 10" 11 . Therefore all of the omitted 
values fall into the 95% Agresti-Coull confidence interval 
[34] of finding no occurrences, that has an upper bound of 
4.64 • 1CT . (Note that when searching for gene clusters, 
what matters in p-value estimation is the order of magni- 
tude. For example, p-values 2.7 • 1(T 380 and 5.4 • 1(T 380 
differ by a factor of two; still, we would regard a gene clus- 
ter with either of these p-values as highly significant). 

For the remaining 182 gene clusters, we have plotted 
the calculated p-values against the empirical p-values in 
Figure 1. Unless our independence assumption is 
severely violated, this should result in points close to 
the main diagonal. Note that both the calculated 
p-values and the empirical p-values can deviate from the 
true p-value; this estimate becomes highly inaccurate, 
particularly for very small empirical p-values, as only 
few of the billions of genomes contain an occurrence of 
the gene cluster. 

Nevertheless, we find an excellent agreement between 
the calculated p-values and the empirical p-values. For 
numerical comparison of these values, we transform 
both into the log scale: Otherwise, almost all p-values 
are far too small to contribute to Pearson correlation or 




10 11 10 10 10" 9 10" 8 10" 7 10" 6 10" 5 10" 4 10" 3 
calculated p-value 



Figure 1 Comparison of empirical and calculated p-values 
for bacterial genomes. Comparison of empirical and calculated 
p-values for gene clusters from four bacterial genomes (log-log 
plot, jV = 182)- Pearson correlation of log-log-pairs is 
r = +0.9976775 (r 2 = 0.9953604) coefficient of 
determination of log-log-pairs is R 2 = 0.9951661 (identity in 
red). Error bars visualize the 95% Agresti-Coull confidence interval. 



coefficient of determination. The Pearson correlation of 
the log-log pairs is r = +0.9976775 (r 2 = 0.9953604). 
As we want to use the calculated p-values as a predictor 
of the empirical p-values, we also computed the coeffi- 
cient of determination 

R i = 1 _ E,(k-*0 2 
E.(y.-y) 2 

where the Yi are the observations (log empirical 
p-values), the Xi are the predictions (log calculated 
p-values), and y is the mean of the observations. For the 
bacterial genomes, we achieve R 2 = 0.9951661- The 
observed deviations for small probabilities must be attribu- 
ted to the fact that, for these reference gene clusters, ten 
billion random genomes are not enough to give a good 
estimate of the true value. It appears very likely that the 
empirical p-value is inaccurate, rather than the calculated 
p-value. 

We argue that these results strongly indicate that our 
calculated p-values are very close to the true values; 
hence, although equation (2) does not take into account 
statistical dependencies, our calculations are still highly 
accurate. 

The number of p-value pairs in the above study is rela- 
tively small and not sufficient to firmly conclude that our 
calculations are accurate. To obtain a greater degree of 
certainty, we performed a second study using random 
genomes. Here, random genomes of different sizes and 
with different character distributions were generated. In 
order to create random genomes with similar characteris- 
tics to true biological data, we studied the gene family 
size distribution in real genomes. (A complete list of the 
genomes can be found in Additional file 1). Gene family 
size appears to roughly follow a heavy-tailed distribution 
(Additional file 1). The Pareto distribution was therefore 
selected for simulating genomes. The probability density 
function is: 

CiX^ 1 

f(x) = -JZL, a ,k> 0, x > x m . (10) 

In the following calculations, we use x m = 1, so that 
each gene appears at least once. The bacterial genomes 
we use later on for our evaluation approximately follow 
a Pareto distribution with a = 2.8 (Additional file 1). 

For each random genome we uniformly draw its length 
n G [1250, 1750]. To select the character content of the 
genome, we repeat the following: We choose the next 
character and draw the number of occurrences of this 
character in the random genome using the above Pareto 
distribution. We repeat this until all n positions of the 
random genome are filled, discarding surplus copies of 
the last added character. In this manner, we generated 
ten genome contents. 
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To generate a random gene order, we could concate- 
nate the genes and shuffle the resulting string. To speed 
up computations, we proceeded in a slightly different 
way: Instead of generating a random genome and then 
searching for reference gene clusters, we simply assume 
a gene cluster C to be present. To obtain useful p-values 
we combined different strategies to construct C- (1) nine 
clusters are chosen with two to ten genes using the most 
commonly occurring genes; (2) nine clusters are chosen 
with two to ten genes using the rarest genes, usually 
occurring only once; (3) 82 clusters are chosen with two 
to ten genes, randomly selected. For each of these 100 
gene clusters we randomly choose a maximum allowed 
distance 5 e [0, |C| — 2]. As mentioned earlier, for higher 
values of 5 exact p-values can be easily computed, simply 
by testing whether a single gene from C is present in the 
genome. 

We then proceeded as described above, comparing the 
calculated p-value to an empirical p-value obtained from 
one billion random draws. From the 10 • 100 = 1000 gene 
clusters tested, we omitted 249 p-value pairs where no 
occurrences were empirically observed. The maximum 
calculated p-value for any such omitted pair is 2.45 • 10~ , 
while the upper bound of the 95% Agresti-Coull confi- 
dence interval for observing no occurrence is 4.64 • 10~ 9 . 
To further increase the accuracy of the empirical estima- 
tion of the 49 clusters that were found between one and 
ten times only, we did an additional nine billion random 
draws. For five clusters, we still observed less than ten 
occurrences; the largest calculated p-value among these 
clusters is 5.40 • 10~ 10 , the upper bound of the 95% 
Agresti-Coull interval for observing nine occurrences is 
1.74 • 10~ 9 . As for these cases the empirically determined 
p-values are presumably inaccurate, we excluded them 
from our analysis. 

For the remaining 746 gene clusters, the calculated 
p-values vs. empirical p-values are plotted in Figure 2. 
Again, we see an excellent agreement between calcu- 
lated and empirical p-values: Pearson correlation of the 
log-log pairs is r = +0.99989 (r 2 = 0.999783), and the 
coefficient of determination of the log-log pairs is 

R 2 = 0.99975- 
Results 

To evaluate our method, we used a dataset of 119 bacter- 
ial genomes from the RefSeq database [32]. A complete 
list of the genomes can be found in Additional File 1. 
This dataset has previously been used by Ling et al. [16] 
for the evaluation of the MCMuSeC software; we 
removed 14 plasmids. A partitioning of the genes of 
these genomes into homology families is available based 
on COG (Clusters of Orthologous Groups) numbers [35]. 
We deliberately ran our analysis on this set of well- 
described bacteria to facilitate evaluation of our cluster 
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calculated p-value 

Figure 2 Comparison of empirical and calculated p-values for 

simulated data. Comparison of empirical and calculated p-values 

for random gene clusters in random genomes (log-log plot, 

M = 746)- Pearson correlation of log-log-pairs is 

r = + 0.99989 (r 2 = 0.999783} coefficient of 

determination of log-log-pairs is jj 2 = q 99975 (identity in red). 
1 ! , 



predictions based on the annotations provided in the 
database. 

We searched for reference clusters with Mycobacter- 
ium tuberculosis CDC1551 (refSeq NC_002755) 
as the reference genome, which contains 4189 
genes. We used five different combinations of 5 and s, 
namely (3 = 0,5 = 4), (5=1,5 = 5), (5 = 2,5 = 6), 
(3 = 3, 5 = 7) and (5 = 5, 5 = 8). The quorum parameter 
was set to fe' = 10 in each case. Finding the gene clusters 
took about 9.3 minutes on a laptop computer (run as a 
single thread on an Intel i5 M520 processor, 2.40 GHz, 
8 GB RAM). For multiple genome p-value calculation, 
we applied the "individual distance bounds" method. 
Computing p-values for the resulting 582 gene clusters 
(including duplicates) required 1.2 minutes. The p-values 
were FDR-corrected for the 8, 775, 955 intervals in the M. 
tuberculosis genome. Gene cluster lists were merged and 
duplicate occurrences removed. For gene clusters where at 
least one occurrence, in one of the genomes, intersected, 
we report only the one with the smaller p-value. This 
resulted in 63 gene clusters. The best 20 are shown in 
Table 1; the complete list can be found in Additional File 1. 
The gene cluster with the best p-value (3.24 • 10~ after 
FDR correction) is found in 108 out of the 119 genomes; it 
contains nine genes with functional annotations linked to 
the 50S and 30S ribosomal subunits. By contrast, the sec- 
ond most significant gene cluster appears in 1 14 genomes 
and shows a much higher degree of conservation with 
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Table 1 Top 20 gene cluster predictions 
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The first 20 gene clusters found when searching Mycobacterium tuberculosis CDC1551 against 118 bacterial genomes. Clusters are sorted by p-values, computed 
using the "individual distance bounds" method. "G" is the number of different genes in the reference gene cluster; "GN" is the number of genomes where the 
reference gene cluster is found; "distance to ref." indicates the observed distances between the reference gene cluster and its occurrences. The "p-score" is the 
negative log 10 of the p-value, before and after FDR correction. 



regards to inserted and deleted genes. However, it contains 
just seven genes (functional annotation is also 50S/30S 
ribosomal subunits); its p-value is 6.61 • 1(T 1252 . The cluster 
in position four of the list (NADH dehydrogenase) is only 
contained in 57 genomes. The p-value, of 2.68 • 10~ 891 , is 
still very low as it contains nine genes and a low average 
distance of 1.4. 

We have also computed p-values using the "maximum 
distance bound" method, see Additional file 1. The best 
scoring cluster, in both cases, is part of the 30S/50S ribo- 
somal subunit, with nine conserved genes. However, using 
the "maximum distance bound" method, the best scoring 
occurrence of this cluster is only found in 66 genomes, 
with a p-value of 1.19 • 1(T 946 , while the occurrence in 108 
genomes only receives a p-value of 1.43 • 1CT 714 due to its 
maximum distance of only five. The p-value for the 66 
genomes occurrence using the "individual distance 
bounds" method is 2.37 • 1(T 947 , while the 108 genomes 
occurrence receives a p-value of 3.24 • 10~ 1308 . 

None of the clusters in the above experiments have a 
corrected p-value anywhere close to 0.05, the typical 
threshold used to discriminate significant from non- 
significant observations. To get some insight into this 
"grey area" where no confident predictions can be made, 
we studied gene cluster predictions with a corrected p- 
value close to 0.05. We obtained these in three runs of 



our program, (5 = 3, 5 = 1, k' = 9), (s = 4, 5=4, Id = 8) 
and (5 = 6, 5 = 7, k' = 7). For each setting we collected 
all predictions with a p-value > 0.05 (corrected p-score < 
-1.3), and also the same number of predictions with the 
biggest p-values < 0.05. The complete list of these 84 pre- 
dictions can be found in Additional file 1. We compared 
the predictions with known E. coli operons that we 
obtained from the RegulonDB database [36]. As can be 
seen in Additional file 1 most of the 84 predictions above 
and below the threshold contain at least one operon. 
A more formal analysis of these findings is hard to obtain 
with the limited data available. What appears to be a 
false positive prediction based on RegulonDB, may in fact 
be an unknown gene cluster, or one that is not well 
enough confirmed to appear in the database. Also a non- 
significant prediction, that is in fact a confirmed operon, 
does not mean that our p-values are too strict. Statistical 
significance by itself is simply not a necessary condition 
for a biological gene cluster. 

Conclusion and outlook 

In this paper, we presented the first statistical model to 
estimate the significance of gene cluster predictions 
under an approximate common interval-based gene clus- 
ter model. The underlying p-value calculations integrate 
all parameters that influence the quality of gene cluster 
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conservation. Namely, the number of genomes in which 
the gene cluster is detected, the size of the gene cluster, 
the degree of conservation within the different occur- 
rences, as well as the genome-wide frequencies of the 
genes involved. To keep the computation time feasible, 
we had to make some simplifying assumptions in the 
p-value calculation, but we have experimentally shown 
that our estimations are remarkably close to empirically 
derived p-values. An analysis of a set of well annotated 
genomes has proven that our method is able to re- 
discover known, highly conserved gene clusters with 
p-values clearly showing that such conservation did not 
occur by chance. The gene cluster at position 20 in our 
output list (functional annotation: arginine biosynthesis) 
still receives a highly significant p-value of 5.20 • 10~ 177 . 
We also studied clusters with low significance and 
observed known operons with p-values below the signifi- 
cance threshold of 0.05, as well as unknown clusters with 
significant p-values. However, due to the limited data 
available it was not possible to distinguish between false 
and true positives. 

Although the simplifying assumptions seem to have 
little effect in practice, it would be an interesting next 
step to study more accurate models in the future. In 
particular, the assumption that the probabilities of 
observing a cluster are statistically independent in over- 
lapping intervals could be omitted. This could be 
achieved by accounting for such dependencies in our 
calculations, for example by employing the inclusion/ 
exclusion principle. Alternatively our present p-value 
approximation might be amenable to control by the 
Chen-Stein method [37]. 

Another strong assumption in our model is that any 
two genomes show fully randomized gene order, unless 
evolutionary pressure prohibits it. This assumption may 
be violated if we analyze genomes of closely related spe- 
cies or strains. In this case, significances will be overrated 
by our p-value estimation. For the dataset analyzed in 
this study at least, this problem is less severe than one 
might expect. Even strains of the same species show a 
relatively high amount of random shuffling (Additional 
file 1). Nevertheless, by lowering the quorum parameter 
to fe' = 5, we observe that many conserved regions are 
detected, where all or most species are Mycobacteria. In 
cases where more closely related strains are analyzed, we 
suggest several workarounds in Additional file 1. In the 
future, it will be an interesting problem to include 
incomplete randomization into our statistical model. 

Additional material 

Additional file 1: (PDF) 
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